15 research outputs found
Camera Calibration from Dynamic Silhouettes Using Motion Barcodes
Computing the epipolar geometry between cameras with very different
viewpoints is often problematic as matching points are hard to find. In these
cases, it has been proposed to use information from dynamic objects in the
scene for suggesting point and line correspondences.
We propose a speed up of about two orders of magnitude, as well as an
increase in robustness and accuracy, to methods computing epipolar geometry
from dynamic silhouettes. This improvement is based on a new temporal
signature: motion barcode for lines. Motion barcode is a binary temporal
sequence for lines, indicating for each frame the existence of at least one
foreground pixel on that line. The motion barcodes of two corresponding
epipolar lines are very similar, so the search for corresponding epipolar lines
can be limited only to lines having similar barcodes. The use of motion
barcodes leads to increased speed, accuracy, and robustness in computing the
epipolar geometry.Comment: Update metadat
SceneScape: Text-Driven Consistent Scene Generation
We present a method for text-driven perpetual view generation -- synthesizing
long-term videos of various scenes solely, given an input text prompt
describing the scene and camera poses. We introduce a novel framework that
generates such videos in an online fashion by combining the generative power of
a pre-trained text-to-image model with the geometric priors learned by a
pre-trained monocular depth prediction model. To tackle the pivotal challenge
of achieving 3D consistency, i.e., synthesizing videos that depict
geometrically-plausible scenes, we deploy an online test-time training to
encourage the predicted depth map of the current frame to be geometrically
consistent with the synthesized scene. The depth maps are used to construct a
unified mesh representation of the scene, which is progressively constructed
along the video generation process. In contrast to previous works, which are
applicable only to limited domains, our method generates diverse scenes, such
as walkthroughs in spaceships, caves, or ice castles.Comment: Project page: https://scenescape.github.io
Text2LIVE: Text-Driven Layered Image and Video Editing
We present a method for zero-shot, text-driven appearance manipulation in
natural images and videos. Given an input image or video and a target text
prompt, our goal is to edit the appearance of existing objects (e.g., object's
texture) or augment the scene with visual effects (e.g., smoke, fire) in a
semantically meaningful manner. We train a generator using an internal dataset
of training examples, extracted from a single input (image or video and target
text prompt), while leveraging an external pre-trained CLIP model to establish
our losses. Rather than directly generating the edited output, our key idea is
to generate an edit layer (color+opacity) that is composited over the original
input. This allows us to constrain the generation process and maintain high
fidelity to the original input via novel text-driven losses that are applied
directly to the edit layer. Our method neither relies on a pre-trained
generator nor requires user-provided edit masks. We demonstrate localized,
semantic edits on high-resolution natural images and videos across a variety of
objects and scenes.Comment: Project page: https://text2live.github.i